Top-K Entity Resolution with Adaptive Locality-Sensitive Hashing

نویسندگان

  • Vasilis Verroios
  • Hector Garcia-Molina
چکیده

Given a set of records, entity resolution algorithms find allthe records referring to each entity. In this paper, we studythe problem of top-k entity resolution: finding all the recordsreferring to the k largest (in terms of records) entities. Top-kentity resolution is driven by many modern applications thatoperate over just the few most popular entities in a dataset.We propose a novel approach, based on locality-sensitivehashing (LSH), that can very rapidly and accurately pro-cess massive datasets. Our key insight is to adaptively de-cide how much processing each record requires to ascertainif it refers to a top-k entity or not: the less likely a recordis to refer to a top-k entity, the less it is processed. Theheavily reduced amount of processing for the vast majorityof records that do not refer to top-k entities, leads to sig-nificant speedups. Our experiments with web images, webarticles, and scientific publications show a 2x to 25x speedupcompared to the traditional approach for high-dimensionaldata.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph-Parallel Entity Resolution using LSH & IMM

In this paper we describe graph-based parallel algorithms for entity resolution that improve over the map-reduce approach. We compare two approaches to parallelize a Locality Sensitive Hashing (LSH) accelerated, Iterative Match-Merge (IMM) entity resolution technique: BCP, where records hashed together are compared at a single node/reducer, vs an alternative mechanism (RCP) where comparison loa...

متن کامل

Towards a Scalable and Robust Entity Resolution -Approximate Blocking with Semantic Constraints

Entity resolution, or record linkage, is the process that identifies data records over one or more datasets which refer to the same real world entity. To deal with large datasets, many real-life applications require scalable and high-quality entity resolution techniques. Blocking techniques can help to scale-up the entity resolution process. Locality sensitive hashing (LSH) is an approximate bl...

متن کامل

An LSH Index for Computing Kendall's Tau over Top-k Lists

We consider the problem of similarity search within a set of top-k lists under the Kendall’s Tau distance function. This distance describes how related two rankings are in terms of concordantly and discordantly ordered items. As top-k lists are usually very short compared to the global domain of possible items to be ranked, creating an inverted index to look up overlapping lists is possible but...

متن کامل

Multi-Level Spherical Locality Sensitive Hashing For Approximate Near Neighbors

This paper introduces “Multi-Level Spherical LSH”: parameter-free, a multi-level, data-dependant Locality Sensitive Hashing data structure for solving the Approximate Near Neighbors Problem (ANN). This data structure is a modified version multi-probe adaptive querying algorithm, with the potential of achieving a O(np + t) query run time, for all inputs n where t <= n. Keywords—Locality Sensitiv...

متن کامل

LSH At Large - Distributed KNN Search in High Dimensions

We consider K-Nearest Neighbor search for high dimensional data in large-scale structured Peer-to-Peer networks. We present an efficient mapping scheme based on p-stable Locality Sensitive Hashing to assign hash buckets to peers in a Chord-style overlay network. To minimize network traffic, we process queries in an incremental top-K fashion leveraging on a locality preserving mapping to the pee...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016